{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 泰坦尼克号生存预测——模型选择与训练SVM与标准化KNN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LIJIANcoder97"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 上一次作业已经使用KNN完成了模型训练，精度为0.72，但是还可以更高吗？我们知道KNN的预测方法依靠与大量的数据，所以如果数据之间单位差距过大，拟合效果就会差，本次使用SKlearn标准化的方法，对数据i进行标准化，再用KNN与SVM进行训练。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from matplotlib import pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.preprocessing import LabelEncoder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 调用sklearn LabelEncoder()，进行编码，文本转数字。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def encode_features(data_set, feature_names):\n",
    "    for feature_name in feature_names:\n",
    "        le = LabelEncoder()\n",
    "        le.fit(data_set[feature_name])\n",
    "        encoded_column = le.transform(data_set[feature_name])\n",
    "        data_set[feature_name] = encoded_column\n",
    "    return data_set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_train=pd.read_csv('dataset/titanic/train.csv')\n",
    "Taitan_test=pd.read_csv('dataset/titanic/test.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "丢弃'Cabin','Name','Ticket'属性列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_train=Taitan_train.drop(['Cabin','Name','Ticket'],axis=1)\n",
    "Taitan_test=Taitan_test.drop(['Cabin','Name','Ticket'],axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 对sex embark属性进行编码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "le = LabelEncoder()\n",
    "le.fit(Taitan_train['Sex'])\n",
    "Taitan_train['Sex']=le.transform(Taitan_train['Sex']) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "le = LabelEncoder()\n",
    "le.fit(Taitan_test['Sex'])\n",
    "Taitan_test['Sex']=le.transform(Taitan_test['Sex']) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Embarked 属性有缺失值，要先进行处理，缺失值替换为最多的登船口"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_train['Embarked'] = Taitan_train['Embarked'].fillna('S')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['C', 'Q', 'S'], dtype=object)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "l = LabelEncoder()\n",
    "l.fit(Taitan_train['Embarked'])\n",
    "l.classes_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_train['Embarked']=l.transform(Taitan_train['Embarked']) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_test['Embarked'] = Taitan_test['Embarked'].fillna('S')\n",
    "l = LabelEncoder()\n",
    "l.fit(Taitan_test['Embarked'])\n",
    "Taitan_test['Embarked']=l.transform(Taitan_test['Embarked']) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 编码完毕\n",
    "### 对Age 属性进行处理，缺失值替换"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_train['Age'] = Taitan_train['Age'].fillna(Taitan_train['Age'].median())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_test['Age'] = Taitan_test['Age'].fillna(Taitan_test['Age'].median())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 创建family属性代表乘客的亲人数目探索其与存活率的关系"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_train['Family']=Taitan_train['SibSp']+Taitan_train['Parch']+1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_test['Family']=Taitan_test['SibSp']+Taitan_test['Parch']+1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_test['Fare'] = Taitan_train['Fare'].fillna(Taitan_train['Fare'].median())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在已经有了清理好的数据集了"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Embarked</th>\n",
       "      <th>Family</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked  \\\n",
       "0            1         0       3    1  22.0      1      0   7.2500         2   \n",
       "1            2         1       1    0  38.0      1      0  71.2833         0   \n",
       "2            3         1       3    0  26.0      0      0   7.9250         2   \n",
       "3            4         1       1    0  35.0      1      0  53.1000         2   \n",
       "4            5         0       3    1  35.0      0      0   8.0500         2   \n",
       "\n",
       "   Family  \n",
       "0       2  \n",
       "1       2  \n",
       "2       1  \n",
       "3       2  \n",
       "4       1  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Taitan_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Embarked</th>\n",
       "      <th>Family</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>892</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>34.5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>893</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>47.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>894</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>62.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>895</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>896</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked  Family\n",
       "0          892       3    1  34.5      0      0   7.2500         1       1\n",
       "1          893       3    0  47.0      1      0  71.2833         2       2\n",
       "2          894       2    1  62.0      0      0   7.9250         1       1\n",
       "3          895       3    1  27.0      0      0  53.1000         2       1\n",
       "4          896       3    0  22.0      1      1   8.0500         2       3"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Taitan_resout=Taitan_test\n",
    "Taitan_test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_p=Taitan_train['Survived']\n",
    "Taitan_train=Taitan_train.drop(['Survived'],axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 数据标准化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "standa = StandardScaler()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\Anaconda3\\lib\\site-packages\\sklearn\\preprocessing\\data.py:625: DataConversionWarning: Data with input dtype int32, int64, float64 were all converted to float64 by StandardScaler.\n",
      "  return self.partial_fit(X, y)\n",
      "D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:2: DataConversionWarning: Data with input dtype int32, int64, float64 were all converted to float64 by StandardScaler.\n",
      "  \n"
     ]
    }
   ],
   "source": [
    "standa.fit(Taitan_train)\n",
    "Taitan_train=standa.transform(Taitan_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\Anaconda3\\lib\\site-packages\\sklearn\\preprocessing\\data.py:625: DataConversionWarning: Data with input dtype int32, int64, float64 were all converted to float64 by StandardScaler.\n",
      "  return self.partial_fit(X, y)\n",
      "D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:2: DataConversionWarning: Data with input dtype int32, int64, float64 were all converted to float64 by StandardScaler.\n",
      "  \n"
     ]
    }
   ],
   "source": [
    "standa.fit(Taitan_test)\n",
    "Taitan_test=standa.transform(Taitan_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 模型训练\n",
    "## KNN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.metrics import classification_report"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "导入knn和准确率评估方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先通过网格搜索选出最合适的超参数，将训练集分为两部分一部分用于训练，一部分用于验证。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train=Taitan_train[0:712]\n",
    "x_test=Taitan_train[712: ]\n",
    "y_train=Taitan_p[0:712]\n",
    "y_test=Taitan_p[712: ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "训练集分割完毕开始训练,使用网格搜索"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "k= 2 ;p= 1 ;w= uniform ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.82      0.90      0.86       115\n",
      "           1       0.78      0.66      0.71        64\n",
      "\n",
      "   micro avg       0.81      0.81      0.81       179\n",
      "   macro avg       0.80      0.78      0.79       179\n",
      "weighted avg       0.81      0.81      0.81       179\n",
      "\n",
      "k= 2 ;p= 1 ;w= distance ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.86      0.78      0.82       115\n",
      "           1       0.66      0.77      0.71        64\n",
      "\n",
      "   micro avg       0.78      0.78      0.78       179\n",
      "   macro avg       0.76      0.77      0.76       179\n",
      "weighted avg       0.79      0.78      0.78       179\n",
      "\n",
      "k= 2 ;p= 2 ;w= uniform ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.82      0.90      0.86       115\n",
      "           1       0.78      0.66      0.71        64\n",
      "\n",
      "   micro avg       0.81      0.81      0.81       179\n",
      "   macro avg       0.80      0.78      0.79       179\n",
      "weighted avg       0.81      0.81      0.81       179\n",
      "\n",
      "k= 2 ;p= 2 ;w= distance ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.86      0.80      0.83       115\n",
      "           1       0.68      0.77      0.72        64\n",
      "\n",
      "   micro avg       0.79      0.79      0.79       179\n",
      "   macro avg       0.77      0.78      0.77       179\n",
      "weighted avg       0.80      0.79      0.79       179\n",
      "\n",
      "k= 3 ;p= 1 ;w= uniform ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.88      0.83      0.86       115\n",
      "           1       0.73      0.80      0.76        64\n",
      "\n",
      "   micro avg       0.82      0.82      0.82       179\n",
      "   macro avg       0.80      0.82      0.81       179\n",
      "weighted avg       0.83      0.82      0.82       179\n",
      "\n",
      "k= 3 ;p= 1 ;w= distance ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.87      0.83      0.85       115\n",
      "           1       0.72      0.78      0.75        64\n",
      "\n",
      "   micro avg       0.82      0.82      0.82       179\n",
      "   macro avg       0.80      0.81      0.80       179\n",
      "weighted avg       0.82      0.82      0.82       179\n",
      "\n",
      "k= 3 ;p= 2 ;w= uniform ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.87      0.84      0.86       115\n",
      "           1       0.74      0.78      0.76        64\n",
      "\n",
      "   micro avg       0.82      0.82      0.82       179\n",
      "   macro avg       0.80      0.81      0.81       179\n",
      "weighted avg       0.82      0.82      0.82       179\n",
      "\n",
      "k= 3 ;p= 2 ;w= distance ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.87      0.85      0.86       115\n",
      "           1       0.74      0.77      0.75        64\n",
      "\n",
      "   micro avg       0.82      0.82      0.82       179\n",
      "   macro avg       0.80      0.81      0.81       179\n",
      "weighted avg       0.82      0.82      0.82       179\n",
      "\n",
      "k= 4 ;p= 1 ;w= uniform ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.83      0.88      0.86       115\n",
      "           1       0.76      0.69      0.72        64\n",
      "\n",
      "   micro avg       0.81      0.81      0.81       179\n",
      "   macro avg       0.80      0.78      0.79       179\n",
      "weighted avg       0.81      0.81      0.81       179\n",
      "\n",
      "k= 4 ;p= 1 ;w= distance ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.87      0.84      0.86       115\n",
      "           1       0.74      0.78      0.76        64\n",
      "\n",
      "   micro avg       0.82      0.82      0.82       179\n",
      "   macro avg       0.80      0.81      0.81       179\n",
      "weighted avg       0.82      0.82      0.82       179\n",
      "\n",
      "k= 4 ;p= 2 ;w= uniform ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.84      0.90      0.87       115\n",
      "           1       0.79      0.70      0.74        64\n",
      "\n",
      "   micro avg       0.83      0.83      0.83       179\n",
      "   macro avg       0.82      0.80      0.81       179\n",
      "weighted avg       0.82      0.83      0.82       179\n",
      "\n",
      "k= 4 ;p= 2 ;w= distance ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.89      0.83      0.86       115\n",
      "           1       0.73      0.81      0.77        64\n",
      "\n",
      "   micro avg       0.83      0.83      0.83       179\n",
      "   macro avg       0.81      0.82      0.82       179\n",
      "weighted avg       0.83      0.83      0.83       179\n",
      "\n",
      "k= 5 ;p= 1 ;w= uniform ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.88      0.83      0.86       115\n",
      "           1       0.73      0.80      0.76        64\n",
      "\n",
      "   micro avg       0.82      0.82      0.82       179\n",
      "   macro avg       0.80      0.82      0.81       179\n",
      "weighted avg       0.83      0.82      0.82       179\n",
      "\n",
      "k= 5 ;p= 1 ;w= distance ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.88      0.85      0.87       115\n",
      "           1       0.75      0.80      0.77        64\n",
      "\n",
      "   micro avg       0.83      0.83      0.83       179\n",
      "   macro avg       0.82      0.82      0.82       179\n",
      "weighted avg       0.84      0.83      0.83       179\n",
      "\n",
      "k= 5 ;p= 2 ;w= uniform ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.88      0.86      0.87       115\n",
      "           1       0.76      0.80      0.78        64\n",
      "\n",
      "   micro avg       0.84      0.84      0.84       179\n",
      "   macro avg       0.82      0.83      0.83       179\n",
      "weighted avg       0.84      0.84      0.84       179\n",
      "\n",
      "k= 5 ;p= 2 ;w= distance ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.89      0.85      0.87       115\n",
      "           1       0.75      0.81      0.78        64\n",
      "\n",
      "   micro avg       0.84      0.84      0.84       179\n",
      "   macro avg       0.82      0.83      0.83       179\n",
      "weighted avg       0.84      0.84      0.84       179\n",
      "\n",
      "k= 6 ;p= 1 ;w= uniform ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.81      0.95      0.88       115\n",
      "           1       0.87      0.61      0.72        64\n",
      "\n",
      "   micro avg       0.83      0.83      0.83       179\n",
      "   macro avg       0.84      0.78      0.80       179\n",
      "weighted avg       0.83      0.83      0.82       179\n",
      "\n",
      "k= 6 ;p= 1 ;w= distance ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.87      0.85      0.86       115\n",
      "           1       0.74      0.77      0.75        64\n",
      "\n",
      "   micro avg       0.82      0.82      0.82       179\n",
      "   macro avg       0.80      0.81      0.81       179\n",
      "weighted avg       0.82      0.82      0.82       179\n",
      "\n",
      "k= 6 ;p= 2 ;w= uniform ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.85      0.90      0.87       115\n",
      "           1       0.79      0.72      0.75        64\n",
      "\n",
      "   micro avg       0.83      0.83      0.83       179\n",
      "   macro avg       0.82      0.81      0.81       179\n",
      "weighted avg       0.83      0.83      0.83       179\n",
      "\n",
      "k= 6 ;p= 2 ;w= distance ;结果:               precision    recall  f1-score   support\n",
      "\n",
      "           0       0.89      0.83      0.86       115\n",
      "           1       0.73      0.81      0.77        64\n",
      "\n",
      "   micro avg       0.83      0.83      0.83       179\n",
      "   macro avg       0.81      0.82      0.82       179\n",
      "weighted avg       0.83      0.83      0.83       179\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for k in [2,3,4,5,6]:\n",
    "    for p in [1,2]: ##1代表明可夫斯基距离，2代表欧拉距离\n",
    "        for w in ['uniform','distance']: ##uniform 代表不考虑距离权重\n",
    "            n= KNeighborsClassifier(n_neighbors=k,weights=w,p=p)\n",
    "            n.fit(x_train,y_train)##构造模型！！\n",
    "            y_p=n.predict(x_test)##预测\n",
    "            print('k=',k,';p=',p,';w=',w,';结果:',classification_report(y_test,y_p))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "准确率的参考属性时micro"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "震惊！！，使用标准化数据准确路提升10个百分点，准确率最高为 0.84，分别是 5，2，uniform；5，2，distance；\n",
    "选择5，2，distance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "n= KNeighborsClassifier(n_neighbors=5,weights='distance',p=2)\n",
    "n.fit(Taitan_train,Taitan_p)##构造模型！！\n",
    "y_re=n.predict(Taitan_test)##预测"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,\n",
       "       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,\n",
       "       1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,\n",
       "       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,\n",
       "       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,\n",
       "       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,\n",
       "       1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,\n",
       "       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,\n",
       "       1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0,\n",
       "       0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,\n",
       "       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,\n",
       "       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,\n",
       "       0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,\n",
       "       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1,\n",
       "       0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,\n",
       "       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,\n",
       "       0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0,\n",
       "       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,\n",
       "       0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1],\n",
       "      dtype=int64)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_resout=pd.read_csv('dataset/titanic/gender_submission.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_resout['Survived']=y_re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_re=Taitan_resout"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_re.to_csv('dataset/titanic/RE-stknn.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>892</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>893</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>894</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>895</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>896</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived\n",
       "0          892         0\n",
       "1          893         0\n",
       "2          894         0\n",
       "3          895         0\n",
       "4          896         0"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Taitan_re.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>418.000000</td>\n",
       "      <td>418.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>1100.500000</td>\n",
       "      <td>0.392344</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>120.810458</td>\n",
       "      <td>0.488858</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>892.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>996.250000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>1100.500000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>1204.750000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1309.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       PassengerId    Survived\n",
       "count   418.000000  418.000000\n",
       "mean   1100.500000    0.392344\n",
       "std     120.810458    0.488858\n",
       "min     892.000000    0.000000\n",
       "25%     996.250000    0.000000\n",
       "50%    1100.500000    0.000000\n",
       "75%    1204.750000    1.000000\n",
       "max    1309.000000    1.000000"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Taitan_re.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 分析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "标准化数据knn的训练与预测完成了，理想情况下，测试集的预测结果准确率也会达到0.84，相比之前的0.72取得了巨大的提高。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面将使用SVM模型进行预测"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.svm import SVC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wall time: 6min 29s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "max=0\n",
    "c_a=0\n",
    "degree_a=0\n",
    "kernel_a='a'\n",
    "for c in range(1,100):\n",
    "    for degree in range(3,20): \n",
    "        for kernel in ['rbf','poly','linear','sigmoid']: ##uniform 代表不考虑距离权重\n",
    "           clf=SVC(C=c,degree=degree,kernel=kernel,gamma='auto')\n",
    "           clf.fit(x_train,y_train) \n",
    "           y_p=clf.predict(x_test)\n",
    "           a=accuracy_score(y_test,y_p,)\n",
    "           if accuracy_score(y_test,y_p,)>max :\n",
    "              max=a\n",
    "              c_a=c\n",
    "              degree_a=degree\n",
    "              kernel_a=kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "准确度 0.8715083798882681  c 3  degree 3  kernel rbf\n"
     ]
    }
   ],
   "source": [
    "print('准确度',max,' c',c_a,' degree',degree_a,' kernel',kernel_a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf=SVC(C=3,degree=3,kernel='rbf',gamma='auto')\n",
    "clf.fit(Taitan_train,Taitan_p)\n",
    "y_p=clf.predict(Taitan_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_resout['Survived']=y_p"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_re=Taitan_resout"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [],
   "source": [
    "Taitan_re.to_csv('dataset/titanic/RE - SVM.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>892</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>893</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>894</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>895</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>896</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived\n",
       "0          892         0\n",
       "1          893         0\n",
       "2          894         0\n",
       "3          895         0\n",
       "4          896         0"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Taitan_re.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### NOW 我们已经用标准化得KNN 与SVM模型预测完毕了，准确率分别为0.84和0.87."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
